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Abstract 

We give two different characterizations of the type of dimensionality-reducing map II that 
can be used for spectral error approximate matrix multiplication (AMM). Both imply a random 
data-oblivious II with m = 0(r/E 2 ) rows suffices, where f is the maximum stable rank , i.e. 
squared ratio of Frobenius and operator norms, of the matrices being multiplied. This answers 
the main open question of |MZ111 IKVZ14I . and is optimal for any random oblivious map. 

Both characterizations apply to a general class of random II, and one even to deterministic II. 

Recall an (e, 5, d)-oblivious subspace embedding (OSE) distribution T> over matrices II £ R mxn 
is such that for any d-dimensional linear subspace E of R", P(||(II[/) T (n[/) — 7j| > e) < S, where 
the columns of U form an orthonormal basis for E. In one characterization, we show if this tail 
bound was established via the moment method, then to obtain AMM it suffices that T> be an 
(e, 5, 2r)-OSE. That is, we show being an OSE for dimension (i.e. rank) k implies black box AMM 
for matrices of stable rank k. Once this is shown, our main result is then just a simple corollary 
of the fact that subgaussian maps with m = P((d + log(l/<5))/e 2 ) rows are (e, 6, d)-OSE’s. Also, 
for all known OSE’s, the best analyses indeed are via the moment method (or tools such as 
matrix Chernoff, which themselves also imply moment bounds). Thus our theorem can be 
applied to a much more general class of sketching matrices than just the subgaussian sketches in 
[ MZll : , iKVZ14l . in addition to achieving better bounds. This inclu des fast su bspace embeddings 
such as the Subsampled Randomized Hadamard Transform |Sar06llLWM + 07i or sparse subspace 
embeddings [CW131 |MM13 > S NN13 , Cohl61 , or even constructions that may be developed in the 
future, to show that rank bounds in their analyses in previous work can automatically be replaced 
with stable rank. Our second characterization identifies certain deterministic conditions which 
if satisfied imply the AMM guarantee. We show these conditions are sufficiently precise to yield 
optimal results for subgaussian maps and even a deterministic II such as the truncated SVD. 

Our main theorem, via connections with spectral error matrix multiplication proven in pre¬ 
vious work, implies quantitative improvements for approximate least squares regression and low 
rank approximation |Sar06| . and implies faster low rank approximation for popular kernels in 
machine learning such as the gaussian and Sobolev kernels. Our main result has also already been 
applied to improve dimensionality reduction guarantees for fc-means clustering |CEM + 15] , and 
also implies new results for nonparametric regression when combined with results in JYPW151 . 

Lastly, we point out a minor but interesting observation that the proof of the “BSS” deter¬ 
ministic row-sampling result of [ IBSS121 can be modified to show that for any matrices A, B of 
stable rank at most r, one can achieve the spectral norm guarantee for approximate matrix mul¬ 
tiplication of A T B using a deterministic sampling matrix with 0(r/e 2 ) non-zero entries which 
can be found in polynomial time. The original result of |BSSl2l was for rank instead of stable 
rank. Our observation leads to a stronger version of a main theorem of |KMST10j . 
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1 Introduction 


Much recent work has successfully utilized randomized dimensionality reduction techniques to speed 
up solutions to linear algebra problems, with applications in machine learning, statistics, optimiza¬ 
tion, and several other domains; see the recent monographs [HMTlll IMahlll IWool4] for more 
details. In our work here, we give new spectral norm guarantees for approximate matrix multipli¬ 
cation (AMM). Aside from AMM being interesting in its own right, it has become a useful primitive 
in the literature for analyzing algorithms for other large-scale linear algebra problems as well. We 
show applications of our new guarantees to speeding up standard algorithms for generalized regres¬ 
sion and low-rank approximation problems. We also describe applications of our results to /c-means 
clustering (discovered in CEM + 15] ) and nonparametric regression [YPW15 1. 

In AMM we are given A, B each with a large number of rows n, and the goal is to compute 
some matrix C such that ||C — A T B\\x is “small”, for some norm || • \\x • Furthermore, we would 
like to compute C much faster than the usual time required to exactly compute A T B. 

Work on randomized methods for AMM began with [DKM06] . which focused on || ■ ||x = || • \\p, 
Frobenius norm error. They showed by picking an appropriate sampling matrix II € R mxn , 


i.e. 


||(IL4) t (ILB) - A t B\\ f < e 
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(1) 


with good probability, if m = Tl(l/e 2 ). By a sampling matrix , we mean the rows of II are indepen¬ 
dent, and each row is all zero except for a 1 in a random location according to some appropriate 
distribution. If A £ R nxd and B € R nxp , note (LL4) T (IIB) can be computed in 0(mdp) time once 
IIA and TIB are formed, as opposed to the straightforward 0(ndp) time to compute A 1 B. 

The Frobenius norm error guarantee of Eq. (JT]) was also later achieved in |Sar06l Lemma 6] via 
a different approach, with some later optimizations to the parameters in [ IKN141 Theorem 6.2], The 
approach of Sarlos was not via sampling, but rather to use a matrix II drawn from a distribution 
satisfying an “oblivious Johnson-Lindenstrauss (JL)” guarantee, i.e. a distribution V over R mxn 
satisfying the following condition for some e, 5 € (0,1/2): 

Vx € R n , P (|||IIx||i - ||x||!| > e||x||l) < 6 . (2) 


Such a matrix II can be taken with m = 0(e~ 2 log(l/<5)) [ JL84j . Furthermore, one can take II 
to be a Fast JL transform [AC09] (or any of the follow-up improvements ( ALU . KW11 . NPW14 , 
IBoul4l ERE]) or a sparse JL transform [DKS101IKN 14] to speed up the computation of II A and 
LliJ. One could also use the Thor up-Zhang sketch [TZ12] combined with a certain technique of 
[LBKW14| (see |Wool41 Theorem 2.10] for details) to efficiently boost success probability. 

Other than Frobenius norm error, the main other error guarantee investigated in previous work 
is spectral error. That is, we would like ||C — A T 5|| to be small, where \\M\\ denotes the largest 
singular value of M. If one is interested in applying A r B to some set of input vectors then this type 
of error is the most meaningful, since ||C — A r 5|| being small is equivalent to \\Cx\\ ~ ||A T Bx|| 
for any x. The first work along these lines was again by |DKMD6] . who gave a procedure based on 
entry-wise sampling of the entries of A and B. The works jDMMflfillSSll] showed that row-sampling 
according to leverage scores also provides the desired guarantee with few samples. 

Then [Sarflbj . combined with a quantitative improvement in [CW13] , showed that one can 
take a II drawn from an oblivious JL distribution with d = 2~®( r ) where r(-) denotes rank and 
r = r(A) + r(B). Then for II with m = 0((r + log(l/d))/e 2 ), with probability at least 1 — 6 over II, 

mA) T (UB)-A T B\\<£\\A\\\\B\\. (3) 
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As we shall see shortly via a very simple lemma (Lemma [T]), a sufficient deterministic condition 
implying Eq. <[3|) is that II is an O^e)-sub space embedding for the r-dimensional subspace spanned 
by the columns of A, B. The notion of a subspace embedding was introduced by ISarOGI . 

Definition 1. II is an e-subspace embedding for U € M n,xr , U T U = I, if II satisfies Eq. Q with 
A = B = U, i.e. ||(nf7) T (II[/) — 7|| < e. This is equivalent to Mx € R r , (1 — e)||x||| < ||nC/x||| < 
(1 + e)||x|| 2 , i.e. II preserves norms of all vectors in the subspace spanned by the columns U. 

An (e, 5, r)-oblivious subspace embedding (OSE) is a distribution V overM. mxn such that 

MU £ M nxr , U T U = I, P (||(IK/) T (m7) — JH > e) < 5. 

n~v 

Fast subspace embeddings II, i.e. such that the products BA and II B can be computed quickly, 
are known using variants on the Fast JL transform such as the Subsampled Randomized Hadamard 
Transform (SRHT) [SarOfil lLWM + 07l ITrolll ILDFII13] (also see a slightly improved analysis of the 
SRHT in Section DOl) or via sparse subspace embeddings |CW13llMM13llNN131ILMP13llCLM + 15 
ICohlfij . In most applications it is important to have a fast subspace embedding to shrink the time it 
takes to transform the input data to a lower-dimensional form. The SRHT is a construction of a n 
such that n A can be computed in time 0(nd\ogn) (see Section [A. 2l for details of the construction). 
The sparse subspace embedding constructions have some parameter m rows and exactly s non-zero 
entries per column, so that UA can be computed in time 0(s -nnz(A)), where nnz(-) is the number 
of non-zero entries, and there is a tradeoff in the upper bounds between m and s. 

An issue addressed by the work of |MZ11| is that of robustness. As stated above, achieving 
Eq. d3j) requires n be a subspace embedding for an r-dimensional subspace. However, consider the 
case when A (and similarly for B ) is of high rank but can be expressed as the sum of a low-rank 
matrix plus high-rank noise of small magnitude, i.e., A = A + Ea for A of rank r{A) -C r, and 
where \\Ea || is very small but Ea has high (even full) rank. One would hope the noise could be 
ignored, but standard results require n to have a number of rows at least as large as r, regardless 
of how small the magnitude of the noise is. Another case of interest (as we will see in Section [3]) 
is when A and B are each of high rank, but their singular values decay at some appropriate rate. 
As discussed in Section [3J in several applications where AMM is not the final goal but rather is 
used as a primitive in analyzing an algorithm for some other problem (such as fc-means clustering 
or nonparametric regression), the matrices that arise do indeed have such decaying singular values. 

The work [MZ11| remedied this by considering the stable ranks r(A),r(B) of A and B. Define 
r(A ) = ||A||^/||A|| 2 . Note f(A ) < r(A ) always, but can be much less if A has a small tail of singular 
values. Let r denote r(A) + r(B). Among other results, |MZ11] showed that to achieve Eq. Q with 
good probability, one can take n to be a random (scaled) sign matrix with either m = fl(f/e 4 ) or 
m = H(r log(d + p)/e 2 ) rows. As noted in follow-up work |KVZ14j . both the 1/e 4 dependence and 
the log(d-fp) factor are undesirable. In their data-driven low dimensional embedding application, 
they wanted a dimension m independent of the original dimensions, which are assumed much larger 
than the stable rank, and also wanted lower dependence on 1/e. To this end, [KVZ14] defined the 
nuclear rank as nr(A) = ||A||*/||A|| and showed m = H(nr/e 2 ) rows suffice for nr = nr(A)-\-nr(B). 
Here ||A||* is the nuclear norm, i.e., sum of singular values of A. Since ||A|||, is the sum of squared 
singular values, it is straightforward to see that nr(A ) > r(A) always. Thus there is a tradeoff: the 
stable rank guarantee is worsened to nuclear rank, but dependence on 1/e is improved to quadratic. 

We show switching to the weaker nr guarantee is unnecessary by showing quadratic dependence 
on 1/e holds even with stable rank. This answers the main open question of fMZlll 1KVZ14| . 
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To state our results in a more natural way, we rephrase our main result to say that we achieve 


H(rL4) T (ILB) - A t B\\ < £] j (pp + (j|5|| 2 + (4) 

for an arbitrary k > 1, and we do so by using subspace embeddings for 0(A;)-dimensional subspaces 
in a certain black box way (which will be made precise soon) regardless of the ranks of A, B. 

Remark 1. Note that our previously stated main contribution is equivalent, since one could 
set k = r(A) + r(B) to arrive at the conclusion that subspace embeddings for 0(r )-dimensional 
subspaces yield the guarantee in Eq. <(3]). Alternatively one could obtain the Eq. (jU guarantee via 
Eq. Q with error parameter s' = @(e • min{l, yj(f(A) ■ r(B))/k}). 

Henceforth, we use the following definition. 

Definition 2. For conforming matrices A T ,B, we say n satisfies the (k, e)-approximate spectral 
norm matrix multiplication property ((fe,e)-AMM) for A, B if Eq. (JU) holds. If II is random and 
satisfies (k,e)-A MM with probability 1 —<5 for any fixed A, B, then we sayYL satisfies ( k,e,5)-AMM. 

Our main contribution: We give two different characterizations for n supporting (k. e)-AMM, 
both of which imply (fc,e,i)-AMM n having m = 0((k + log(l /5))/e 2 ) rows. The first char¬ 
acterization applies to any OSE distribution for which a moment bound has been proven for 
|| (nf7) T (nt/) — 1 1| (which is true for the best analyses of all known OSE’s). In this case, we show 
a black box theorem: any (e, 5, 2/c)-OSE provides (A;, e, <5)-AMM. Since matrices with subgaussian 
entries and m = fl((fc-|-log(l/(5))/e 2 ) are (e, 5, 2/c)-OSE’s, our originally stated main result follows. 
This result is optimal, since |NN14| shows any randomized distribution over n with m rows having 
the (k, e, <5)-AMM property must have m = f l{(k + log(l/<5))/e 2 ) (the hard instance there is when 
A = B = U has orthonormal columns, and thus rank and stable rank are equal). 

Our second characterization identifies certain deterministic conditions which, if satisfied by n, 
imply the desired (fc,e)-AMM property. These conditions are of the form: (1) n should preserve a 
certain set of 0{ log(l/e)) different subspaces of varying dimensions (all depending on k. e and not 
on the ranks of A, B) with varying distortions, and (2) for a certain two matrices in our analysis, 
left-multiplication by n should not increase their operator norms by more than an 0(1) factor. 
These conditions are chosen carefully so that matrices with subgaussian entries and m = H(k/e 2 ) 
satisfy all conditions simultaneously with high probability, again thus proving our main result while 
also suggesting that the conditions we have identified are the “right” ones. 

Due to the black box reliance on the subspace embedding primitive in our proofs, n need not 
only be a subgaussian map. Thus not only do we improve on m compared with previous work, 
but also in terms of the general class of n our result applies to. For example given our first 
characterization, not only does it suffice to use a random sign matrix with H(k/e 2 ) rows, but in 
fact one can apply our theorem to more efficient subspace embeddings such as the SRHT or sparse 
subspace embeddings, or even constructions discovered in the future. That is, one can automatically 
transfer bounds proven for the subspace embedding property to the (k, e)-AMM property. Thus, for 
example, the best known SRHT analysis (in our appendix, see Theorem [9|) implies (A;, e, 5)-AMM 
for m = Q((k + log(l/(e<5)) \og(k/d))/£ 2 ) rows. For sparse subspace embeddings, the analysis in 
[Cohl6] implies m = H{k\og{k/5)/e 2 ) suffices with s = 0{\.og{k/5)/e) non-zeroes per column of n. 
The only reason for the log k loss in m for these particular distributions is not due to our theorems, 
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but rather due to the best analyses for the simpler subspace embedding property in previous work 
already incurring the extra log A; factor (note being a subspace embedding for a A-dimensional 
subspace is simply a special case of (A, e)-AMM where A = B = U has k orthonormal columns). 
In the case of the SRHT, this extra log k factor is actually necessary [Troll] : for sparse subspace 
embeddings, it is conjectured that the log k factor can be removed and that m = n((A’+log(l/5))/e 2 ) 
actually suffices to obtain an OSE [NN131 Conjecture 14]. We also discuss in Remark [3] that one 
can set II to be IR • IR where IR has subgaussian entries with 0{k/e 2 ) rows, and IR is some 
other fast OSE (such as the SRHT or sparse subspace embedding), and thus one could obtain the 
best of both worlds: (1) n has 0(kje 2 ) rows, and (2) can be applied to any A £ M nx,i in time 
T + 0(km'd/£ 2 ), where T is the (fast) time to apply A 2 to A, and m' is the number of rows of n 2 . 
For example, by appropriate composition as discussed in Remark 01 n can have 0(k/e 2 ) rows and 
support multiplying AA for A £ M. nxd in time 0(nnz(A)) + 0(e~°P\k 3 + k 2 d)). 

We also observe the proof of the main result of [BSS12] can be modified to show that given any 
A, B each with n rows, and given any e £ (0,1/2), there exists a diagonal matrix n £ M nxri with 
0(k/e 2 ) non-zero entries, and that can be computed by a deterministic polynomial time algorithm, 
achieving (A,e)-AMM. The original work of [BSS12] achieved Eq. Q with m = 0(r/e 2 ) for r being 
the sum of ranks of A, B. The work [BSS12] stated their result for the case A = B, but the general 
case of potentially unequal matrices reduces to this case; see Section 01 Our observation also turns 
out to yield a stronger form of [KMSTlOl Theorem 3.3]; also see Section 01 


As mentioned, aside from AMM being interesting on its own, it is a useful primitive widely 
used in analyses of algorithms for several other problems, including A-means clustering [BZMD151 
|CEM + 15] . nonparametric regression [YPW15] , linear least squares regression and low-rank ap¬ 
proximation [Sar06j . approximating leverage scores [DMMW12] . and several other problems (see 
[Wool4] for a recent summary). For all these, analyses of correctness for algorithms based on 
dimensionality reduction via some n rely on n satisfying AMM for certain matrices in the analysis. 

After making certain quantitative improvements to connections between AMM and applications, 
and combining them with our main result, in Section [3] we obtain the following new results. 

1. Generalized regression: Given A £ W nxd and B £ M nxp , consider the problem of comput¬ 
ing X* = argminx^dxp || AX — B ||. It is standard that X* = (A T A) + A T B where (-) + is the 
Moore-Penrose pseudoinverse. The bottleneck here is computing A T A, taking 0(nd?) time. 
A popular approach is to instead compute X = ((nA) T (nA)) + (nA) T nR, i.e., the minimizer 
of UnAX" — nB||. Note that computing (nA) T (nA) (given nA) only takes a smaller 0(md 2 ) 
amount of time. We show that if n satisfies (A’, 0( v / e))-AMM for U A ,P A B, aR d is a ^ so an 
0(l)-subspace embedding for a certain r(A)-dimensional subspace (see Theorem [3]), then 

\\AX-B\\ 2 < (1 + e)\\P a B — B\\ 2 + (e/k)\\P A B — B\\ 2 F 


where P A is the orthogonal projection onto the column space of A, P A = I — P A , and U A has 
orthonormal columns forming a basis for the column space of A. The punchline is that if the 
regression error P A B has high actual rank but stable rank only on the order of r(A), then we 
obtain multiplicative spectral norm error with n having fewer rows. Generalized regression 
is a natural extension of the case when B is a vector, and arises for example in Regularized 
Least Squares Classification, where one has multiple (non-binary) labels, and for each label 
one creates a column of B ; see e.g. CLL + l(j] for this and variations. 
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2. Low-rank approximation: We are given A £ W nxd and integer A; > 1. and we want to 
compute Ak = argmin r ^ X )<k 11-A — Y||. The Eckart-Young theorem implies Ak is obtained by 
truncating the SVD of A to the top k singular vectors. The standard way to use dimensionality 
reduction for speedup, introduced in [SarOfij . is to let S = HA then compute A = APg. Then 
return Ak, the best rank-A; approximation of A, instead of A (it is known A can be computed 
more efficiently than Ak ; see [CWA91 Lemma 4.3]). We show if II satisfies (A, 0(y / e))-AMM 
for Uk and A — Ak, and is a (l/2)-subspace embedding for the column space of Ak, then 

11 At ~ A| 2 < (1 + £ )||^4 — Ak\\ 2 + {e/k)\\A — Afc||p. 

The punchline is that if the stable rank of the tail A — Ak is on the same order as the rank 
parameter k, then standard algorithms from previous work for Frobenius multiplicative error 
actually in fact also provide spectral multiplicative error. This property indeed holds for any k 
for popular kernel matrices in machine learning such as the gaussian and Sobolev kernels (see 
jEHVllj and Examples 2 and 3 of |YPW15] ), and low-rank approximation of kernel matrices 
has been applied to several machine learning problems; see |GM13] for a discussion. 

We also explain in Section [3] how our result has already been applied in recent work on dimen¬ 
sionality reduction for A:-means clustering jCLM + 15] , and how it generalizes results in [ YPW15] on 
dimensionality reduction for nonparametric regression to use a larger class of embeddings II. 

1.1 Preliminaries and notation 

We frequently use the singular value decomposition (SVD). For a matrix A £ M nxd of rank r, 
consider the compact SVD A = UaHaYa w h ere Ua € M nxr and Va £ M cixr each have orthonormal 
columns, and E^ is diagonal with strictly positive diagonal entries (the singular values of A). We 
assume (E^)^ > {T.A)j,j for i < j. We let Pa = V/i Vj denote the orthogonal projection operator 
onto the column space of A. We use span(A) to refer to the subspace spanned by A’s columns. 

Often for a matrix A we write Ak as the best rank-A: approximation to A under Frobenius or 
spectral error (obtained by writing the SVD of A then setting all (E^)^ to 0 for i > k). We often 
denote A — Ak as Aj.. For matrices with orthonormal columns, such as Ua, (UjCjk denotes the nx k 
matrix formed by removing all but the first k columns of U. When A is understood from context, 
we often write UT,V t instead of UaHaVJ^, and Uk to denote (IVOfc (and E*. for (E^)*., etc.). 

2 Analysis of matrix multiplication for stable rank 

First we record a simple lemma relating subspace embeddings and AMM. 

Lemma 1. Let E = span{A, B}, and let II be an e-subspace embedding for E. Then Eq. ([3]) holds. 

Proof. First, without loss of generality we may assume ||A|| = ||S|| = 1 since we can divide both 
sides of Eq. ([3]) by ||A|| • ||S||. Let U be a matrix whose columns form an orthonormal basis for E. 
Then note for any x,y we can write Ax = Uw,By = Uz where \\w\\ < ||x||, ||z|| < ||y||. Then 

!|(IL4) T (ILB) - A t B\\ = sup \{UAx,UBy)-(Ax,By)\ 

\\ x \\=\\y \\ =1 

= sup | (UUz, n Uw) - {Uz, Uw) | 
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rows. 


= mu) T (uu)-i\\ 

< £ 


Lemma CD implies that if A, B each have rank at most r, it suffices for II to have fl(r/e 2 ) 

In the following two subsections, we give two different characterizations for II to provide (k,e)- 
AMM, both only requiring II to have Q(k/e 2 ) rows, independent of r. 

2.1 Characterization for (k, e, 5)-AMM via a moment property 

Here we provide a way to obtain {k, e)-AMM for any n whose subspace embedding property has been 
established using the moment method, e.g. sparse subspace embeddings |MM131 INN131 ICohl6j . 
dense subgaussian matrices as analyzed in Section fA. 11 or even the SRHT as analyzed in Section fA.21 
Our approach in this subsection is inspired by the introduction of the “JL-moment property” in 
[KN14] to analyze approximate matrix multiplication with Frobenius error. The following is a 
generalization of |KN14l Definition 6.1], which was only concerned with d = 1. 

Definition 3. A distribution T> over W nxn has (e,5,d,£)- OSE moments if for all matrices U € 
l>nxd with orthonormal columns, 


E \\(UU) T (UU)-I\\ e <e e -5 
n~x>" 11 

Note that this is just a special case of bounding the expectation of an arbitrary function of 
\\{UU) T (UU) — 7||. The arguments below will actually apply to any nonnegative, convex, increas¬ 
ing function of ||(nt/) T (ni/) — 1 1| 2 , but we restrict to moments for simplicity of presentation. The 
acronym “OSE” refers to oblivious subspace embedding , a term coined in |NN13] to refer to dis¬ 
tributions over n yielding a subspace embedding for any fixed subspace of a particular bounded 
dimension with high probability. We start with a simple lemma. 

Lemma 2. Suppose T> satisfies the (e, 5,2d, £)-OSE moment property and A,B are matrices with 
(1) the same number of rows, and (2) sum of ranks at most 2d. Then 

E ||(nA) T (nR) - A t B\\ £ < £ £ \\A\\ £ \\B\\ l ■ 5 

Proof. First, we apply LemmaQ]to A and B, where U forms an orthonormal basis for the subspace 
span{columns(A), columns(H)}, showing that 

\\(ua) t (ub) - a t b\\ < ||(n[/) T (m/)-/|| ||a||||r||. 

Therefore 

E ||(HA) T (nH) - A T B\f < E \\(UU) T (UU) - l|| £ \\A\\ e \\B\\* < e £ \\A\\ e \ 

II 11 r^j U 

Then, just as |KN14l Theorem 6.2] showed that having OSE moments with 
approximate matrix multiplication with Frobenius norm error, here we show that having OSE 
moments for larger d implies approximate matrix multiplication with operator norm error. 


B\\ e -S 

U 

d = 1 implies 
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Theorem 1. Given k,e,5 € (0,1/2), let T> be any distribution over matrices with n columns with 
the (e, <5, 2k, i)-OSE moment property for some i > 2. Then, for any A, B, 


n p, (ji(iL4) T (ns) - a t b\\ > e y jm\ 2 + M%/mm 2 + m 2 F /k)') < s (5) 

Proof. We can assume A, B each have orthogonal columns. This is since, via the full SVD, there 
exist orthogonal matrices Ra,Rb such that ARa and BRb each have orthogonal columns. Since 
neither left nor right multiplication by an orthogonal matrix changes operator norm, 

||(IL4) t (ILB) - A t B || = \\(JIAR a ) T {UBR b ) - (AR a ) T BR b ||. 


Thus, we replace A by ARa and similarly for B. We may also assume the columns a\, 02 ,... of 
A are sorted so that 11a.®11 2 > H&i+ilh for all i. Henceforth we assume A has orthogonal columns in 
this sorted order (and similarly for B , with columns b{). Now, treat A as a block matrix in which 
the columns are blocked into groups of size k, and similarly for B (if the number of columns of 
either A or B is not divisible by k, then pad them with all-zero columns until they are, which does 

not affect the claim). Let the spectral norm of the zth block of A be s, = ||_i)-fe-i-iII 2 j and for B 

denote the spectral norm of the zth block as ti = || 6 (j_i).ji;+i]| 2 - These equalities for A, B hold since 
their columns are orthogonal and sorted by norm. We claim )TL s 2 < ||A || 2 + ||A|||,/A; (and similarly 
for 'ffi tf). To see this, let the blocks of A be A\, ...,A' q where s t = ||A'||. Note sf = 11^4)11 < ||A||. 
Also, for z > 1 we have 

s i = i)-fc+i 11 2 — ^ 11 11 2 = 


Thus 

s i ^ II-^IIfA- 

i> 1 

Define C = (nA) r (nH) — A 1 B. Let vu} denote the zth block of a vector v (the ^-dimensional 
vector whose entries consist of entries (z — 1) • k + 1 to i ■ k of v), and C{i},{j} the (z, j)th block of 
C , a k x k matrix (the entries in C contained in the zth block of rows and jth block of columns). 

Now, He'll = sup|| a ,||_|| J/ ||_ 1 x T Cy. For any such vectors x and y, we define new vectors x’ and 
y' whose coordinates correspond to entire blocks: we let x\ = ||xp}||, with y' defined analogously. 
We similarly define C' with entries corresponding to blocks of C, where C[ ■ = ||C{i},{j}ll- Then 
x T Cy < x' T C'y' , simply by bounding the contribution of each block. Thus it suffices to upper 
bound He'll, which we bound by its Frobenius norm ||C'||i?. Now, recalling for a random variable 
X that ||X||^ denotes (E lA/^) 1 ^ and using Minkowski’s inequality (that || • ||^ is a norm for i > 1), 


WWc'Wlh/i 


^||(HA') T (nH')-A' T H'|| 2 


1/2 


<^||H(nA') T (nH')-AfH '|| 2 ||, /2 


(Lemma ED 
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Now, E ||C"||p = ||||C"||f,||^ 2 , implying 


I B 


P ll^ll > e\/(PH 2 + V Z )(PP + L ~r £ -) < P IIC'IIf > e\ (WM 2 + ^)(PII 2 + ~ j~~) 


151 


< 


E IIC'll^ 


eV(PII 2 + ^ vllR " 2 


B II 2 + 


\\B\\ 2 F 


< 


We now discuss the implications of applying Theorem Q] to specific OSE’s. 

Subgaussian maps: In Section IA.1I we show that if II has independent subgaussian entries and 
m = fl((fc+log(l/d))/e 2 ) rows, then it satisfies the (e, 6 , 2k, 0(fc+log(l/d))) OSE moment property. 
Thus Theorem [T] applies to show that such II will satisfy (k, e, d)-AMM. 

SRHT: The SRHT is the matrix product II = SHD where D € M. nxn is n x n diagonal with 
independent ±1 entries on the diagonal, H is a “bounded orthonormal system” (i.e. an orthogonal 
matrix in M nx?l with maxjj \H l3 \ = 0(1/y/n)), and the m rows of S are independent and each 
samples a uniformly random element of [n]. Bounded orthonormal systems include the discrete 
Fourier matrix and the Hadamard matrix; thus such II exist supporting matrix-vector multiplication 
in O(ralogn) time. Thus when computing BA for some nx d matrix A, this takes time 0(nd log to) 
(by applying II to A column by column). In Theorem [9] we show that the SRHT with m = Q((k + 
log(l/(e<5)) log(k/6))/e 2 ) satisfies the (e, 5, 2k, log(A;/5))-OSE moment property, and thus provides 
(fc,£,d)-AMM. Interestingly our analysis of the SRHT in Section [A.2I seems to be asymptotically 
tighter than any other analyses in previous work even for the basic subspace embedding property, 
and even slightly improves the by now standard analysis of the Fast JL transform given in |AC09] . 

Sparse subspace embeddings: The sparse embedding distribution with parameters m, s is as 
follows |CW13lfNN131lKN14] , The matrix n has m rows and n columns. The columns are indepen¬ 
dent, and for each column exactly s uniformly random entries are chosen without replacement and 
set to ±1 /yfs independently; other entries in that column are set to zero. Alternatively, one could 
use the CountSketch |CCF04j : the m rows are equipartitioned into s sets of size m/s each. The 
columns are independent, and in each column we pick exactly one row from each of the s partitions 
and set the corresponding entry in that column to zkl/y/s uniformly; the rest of the entries in the 
column are set to 0. Note BA can be multiplied in time 0(s-nnz(A)), and thus small s is desirable. 

It was shown in jMM131fNN13j . slightly improving jCW13j . that either of the above distributions 
satisfies the (e, 6, k, 2)-OSE moment property for m = Q(k 2 /(e 2 6)), s = 1, and hence (k, e, <5)-AMM 
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(though this particular conclusion follows easily from |KN14l Theorem 6.2]). It was also shown 
in |Cohl 6 j . improving upon |]NN13j . that they satisfy the (e, <5, k, log(/c/5))-OSE moment property, 
and hence also (k, e, J)-AMM, for m = Q(Bk \og(k/5)/e 2 ), s = Q(log B (k/S)/e) for any B > 2 . It is 
conjectured that for B = 0(1), m = f l((k + log(l /5))/e 2 ) should suffice [NN131 Conjecture 14]. 

Remark 2. The work (Cohlh] does not explicitly discuss the OSE moment property for sparse 
subspace embeddings. Rather, |Cohl 6 | bounds Ee E2l / £ = 0{k ) for E = ||(IE7) T (IE7) - I\\ and 
l = log (k/S). Note though for x > 0 and integer i > 1, ar < i\ ■ e x < r ■ e x by Taylor expansion of 
the exponential. Setting x = 2 (.E/e, )Cohl 6 ] thus implies E(2 lE/e) 1 - < $ ■ Ee® 2 ^ = 0(k). Thus 
E E e = 0{k ) • (e/2 ) £ < 5 by choice of £, which is the (e, 6, k, log(fc/<5))-OSE moment property. 

Remark 3. Currently there appears to be a tradeoff: one can either use II such that II A can be 
computed quickly, such as sparse subspace embeddings or the SRHT, but then the number of rows 
m is at least klogk. Alternatively one could achieve the optimal m = 0{k/e 2 ) using subgaussian 
II, but then multiplying by II is slower: 0(mnd) time for A € M nx<i . However, settling for a 
tradeoff is unnecessary. One can actually obtain the “best of both worlds” by composition, i.e. 
the multiplication n = Hi • B 2 of two matrices both supporting AMM. Thus n 2 could be a fast 
matrix providing AMM to low (but suboptimal) dimension, and Hi a “slow” (e.g. subgaussian) 
matrix with the optimal 0{k/e 2 ) number of rows. In fact one can even set n = nin 2 n 3 where 
n 3 is the sparse subspace embedding with 0(k 2 /s 2 ) rows and s = 1, H 2 is the SRHT, and Hi is 
a subgaussian matrix. Then BA will have the desired 0(k/e 2 ) rows and can be computed in time 
0(nnz(A)) + (/(e -0 ^ 1 ^^ 3 + k 2 d))\ see Section IA.3I for justification. 

2.2 Characterization for (k,s, )-AMM via deterministic events 

Here we provide a different characterization for achieving (k, e)-AMM. Without loss of generality we 
assume max{||A|| 2 , ||A|||,/A:} = max{||R|| 2 , ||R|||,/A:} = 1 (so ||A|| 2 , ||R || 2 < 1 and ||A||^, ||R|||. < k). 

Let w,w' each be minimal such that ||A^||, ||R^,|| < e/C' for some sufficiently large constant 
C' (which will be set in the proof of Theorem[2]). It was shown that w,w' = 0(k/e 2 ) in the proof 
of Theorem 3.2 (i.b) in [MZllj . Write the SVDs A w = Ua w E Aw VJ w , B w > = U Bw , ^B w , V Bw , ■ 

For 0 < i < log 2 (l/e 2 ) define D[ as set of all columns of U Aw , U Bw , whose corresponding squared 
singular values (from E Aw , T* B ,) are at least 1/2*. Let D Aw be the set of min{fc, w} largest singular 
vectors from U Aw , and dehne D Bw , similarly. Define Di = D[ U D Aw U D B ^. Let s l denote the 
dimension of span(Dj), and note the s* are non-decreasing. 

Let Si be Sj after rounding up to the nearest power of 2 . Group all i with the same value of Sj 
into groups G 1 , G 2 , ■ ■ ■, G\ 0 g 2 (i/ £ 2 )- For example if for i = 0 , 1 , 2 ,3 the s t are 3,4,15,16 then the s* 
are 4, 4,16,16 and G\ = (0,1}, G 2 = { 2 ,3}. Let Vj be the common value of s* for i in Gj. 

Lemma 3. ^i s */2* < 8 ^- 

Proof. Define s = \D Aw U D B ,\ < 2k and let s' denote the dimension of span(D(). Then the 
above summation is at most X^i( s /2* + s [ /2 l ) <4 k + s'/2*. It thus suffices to bound the second 

summand by Ak. 

Note that we can find a basis for D[ among the columns of U Aw ,U Bw , with corresponding 
squared singular value at least 1/2*, so let a, + bi = s', where a* is the number of columns of U Aw in 
the basis and bi the number of columns of U Bw , in the basis. Then by averaging, if the inequality 
of the lemma statement does not hold then either ]>A ai/ 2* > 2k or bi/ 2* > 2k. Without loss of 
generality assume the former. 
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Consider an arbitrary column of U Aw , and suppose it has squared singular value in the range 
[1/2*, 1/2* -1 ). Then it is in span(D') for all j > i. Its contribution to Yhi a i/ 2* is therefore 
1/2* + l/2* +1 + ... which is at most 2/2* = 1/2* -1 . It follows that JT Oi/2* < 2 k, since the squared 
Frobenius norm of A w is at most k. This is a contradiction to a,/2* > 2k. ■ 

Now we prove the main theorem of this subsection. 


Theorem 2. Suppose that the following conditions hold: 


(1) If w + w' < k, then II is an e/C-subspace embedding for the subspace spanned by the columns 
of A w , B w i. Otherwise ifw + w' > k, then for each 0 < i < log 2 (l/e 2 ), II is an £i/C -subspace 
embedding for span(5j/) with 


£i 



where i' is the largest i with Si in Gj. 


(2) ||IL4tf||, UnS^H <e/C. 

Then Eq. @ holds as long as C is smaller than some fixed universal constant. 


Proof. We would like to bound 


||(nA) r (TLB) - a t b II < H(n^) T ns^ - a t w b w ,\\ + ||(n^) T ns^|| + \\(ua w ) t ub-,\ 


+ ||(n^) T ns w ,|| + ||A^s U) /|| + ||^s tl) ,|| + 


( 6 ) 


Using ||xy|| < ||X|| • ||T|| for any conforming matrices X,Y, we see A < e 2 /C 2 by condition 
(2). Furthermore by the definition of w,w' we know 11^11,115^,11 < e/C', and thus ( + rj + 
0 < 2 e/C' + (e/C') 2 . Note condition (1) implies that II is a (l/2)-subspace embedding for the 
subspace spanned by columns of A w , B w / (by taking i maximal). Thus by both conditions we have 
/3,7 < (e/C)(l + 1/2). 

It only remains to bound a. If w + w' < k, then we are done by condition (1) and Lemma [T| 
Thus assume w + w' > k. Then we have 


\\(UA w ) t UB w/ - AlB w/ \\ = sup \(UUA w Y Aw x,UUB w ,YB w ,y) - (U Aw T, Aw x,UB wl Y Bw ,y)\ 

IMHMK 

Let x,y be any unit norm vectors. Write x = x 1 + x 2 + ... + x b for b = log 2 (l/e 2 ), where 
x l is the restriction of x to coordinates for which the corresponding squared singular values 
in E Aw are in (1/2*, 1/2*" 1 ]. Similarly define y 1 ,...,y b . Then \ (UU Aw Y Aw x,UUB w ,T Bw ,y) - 
(U Aw Y Aw x,UB w ,Y Bw ,y) | equals 


b b 


E ( uu ^A w x\UUB wl Y Bwl y j ) - (U Aw Z Aw x\ U Bw ,Y Bw ,y j ) 
i =1 j =1 


b 

A 

2=1 


n U Aw Y Aw x\UUb wI Yb wI Ey J ) - ( U Aw E Aw x\^Ub w ,Yb w ,^ 


j<i 


j<i 
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(7) 



n u Aw t, Ai 


E 

*<?' 





We bound the first sum, as bounding the second is similar. Note Ua w ^a w x 1 ,Ub w ,^b w i € 

Di. Therefore by property (1) and Lemma[TJ 


( 


nu Aw z Aw x 




< 


£i 

C 2^ 1 )/ 2 


\\y\ 


( 8 ) 


where Eq. (JH]) used that the corresponding v value in property (1) is at most 2sj. Returning to 
Eq. d?D and applying Cauchy-Schwarz and Lemma O 


E 




j<i 


U Av Xa w x 1 , ^2 U Bw ,^B w ,y 
j<i 

/ b 


< 


2e 


< 


Cy/k 

2y/8e 

C 


V 2 = 1 


b 

sE 

i=i 

!/2 / b \ V 2 

i 112 | 


C2( i ~ 1 )/ 2 


ES • E 


\x 


\ i=1 



We thus finally have that Eq. © is at most (2\/8 + 3 )e/C + +(e/C) 2 + 2 e/C’ + (e/C') 2 , which 
is at most e for C, C' sufficiently large constants. ■ 

Now we discuss some implications of Theorem [2] for specific II. 


Example 1: Let II have 0(k/e 2 ) rows forming an orthonormal basis for the span of the columns 
of A W ,B W /. Property (1) is satisfied for every i in fact with e* = 0. Property (2) is also satisfied 
since ||nR^|| < l|n|| ■ HAdII < e, and similarly for bounding ||ILB^, ||. 


Example 2: Let II be a random mxn matrix with independent entries that are subgaussian with 
variance 1/m. For example, the entries of II may be A/"(0,1/m), or uniform in {— 1/y/rn, 1/y/rn}. Let 
m be Q((k + log(l/<5))/e: 2 ). As mentioned in Section lA. 11 such II is an e-subspace embedding for a 
/c-dimensional subspace with failure probability 5. For property (1) of Theorem[2l if w+w' < k then 
we would like II to be an e-subspace embedding for a subspace of dimension at most k, which holds 
with failure probability 5. If w + w' > k then we would like II to be an ej-subspace embedding for 
span(Hj/) for all 1 < i < log 2 (l/e 2 ) simultaneously. Note maxjVj < 2(w + w') = 0(k/e 2 ), and thus 
rnaxj Vj < m. Thus for a subspace under consideration D;/ for i' € Gj, we have failure probability 
8 v 'j/ k f or our choice of m. By construction every Vj is at least k, and the vj increase at least 
geometrically. Thus our total failure probability is, by a union bound, 5 v ^ k < 5 2J 1 = 0(5). 

Property (2) of Theorem [2] is satisfied with failure probability 5 by |RV13l Theorem 3.2]. 
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3 Applications 


Spectral norm approximate matrix multiplication with dimension bounds depending on stable rank 
has immediate applications for the analysis of generalized regression and low-rank approximation 
problems. We also point out to the reader recent applications of this result to kernelized ridge 
regression [YPW15] and k- means clustering [CEM + 15] . 

3.1 Generalized regression 

Here we consider generalized regression: attempting to approximate a matrix B as AX, with A 
of rank at most k. Let Pa be the orthogonal projection operator to the column space of A, with 
P 4 = I — P; then the natural best approximation will satisfy 

AX = P a B. 

This minimizes both the Frobenius and spectral norms of AX — B. A standard approximation 
algorithm for this is to replace A and B with sketches HA and n B, then solve the reduced problem 
exactly (see e.g. [CW09j . Theorem 3.1). This will produce 

x = ((jiAfuA^iuAfuB 
AX = A((IL4) T nA) _ 1 (IL4) T n.B 

= U a ((HUa) T HUa)-\HUa) T HB. 

Below we give a lemma on the guarantees of the sketched solution in terms of properties of n. 

Theorem 3. If n 

1. satisfies the ( k , y/ e/8 )-approximate spectral norm matrix multiplication property for Ua, P a B 

2. is a (1/2) -subspace embedding for the column space of A (which is implied by n satisfying the 
spectral norm approximate matrix multiplication property for Ua with itself) 

then 

\\AX - B|| 2 < (1 + e)|| PaB - B || 2 + (e/k) ■ \\P A B - B\\ 2 F . (9) 

Proof. We may write: 

WAX - B\\l = \\UA{{mJA) T HU A )-\HUA) T HB - B || 2 

= \\Ua{{HUa) T HUa)-\HUa) T H{PaB + P A B ) - P A B - P A B || 2 
= || P A B + U a ({HUa) T HUa)- 1 (HUa) T HP a B - P A B - P A B \\ 2 
= \\Ua{{HUa) T HUa)-\HUa) T HP a B - P A B\\ 2 . 

So far, we have shown that the error depends only on P A B and not P A B (with the third line 
following from the fact that the sketched regression is exact on PaB). Now, in the last line, we 
can see that the two terms lie in orthogonal column spaces (the first in the span of A, the second 
orthogonal to it). For matrices X and Y with orthogonal column spans, \\X + T|| 2 < ||Y|| 2 + ||Y|| 2 , 
so this is at most 

\\Ua{.{HUa) T HU a )-\HUa) T HP a B\\ 2 + ||P^H|| 2 . 
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Spectral submultiplicativity then implies the first term is at most 

(\\u A \\ ■ muuAfnuA)-^ • \\(uua) t up a b\\) 2 . 

\\Ua\\ is 1 , since Ua is orthonormal. ((UUa) 1 UU a )~ X is at most 2 , since II is a subspace embedding 
for Ua- Finally, \\(UUa) t ^Pa^\\ is most 

^Vdl^ll 2 + II^IIfA)(II^ S II 2 + W P AB\\ 2 F /k) = y/ (e/ 8 ) • 2 • (II P A B - Bf + \\P A B - Bf/k). 
Multiplying these together, squaring, and adding the remaining |||| 2 term gives a bound of 

(1 + e)|| P A B — B\\ 2 + (e/k) • \\P A B - B\\ 2 F 

as desired. ■ 

3.2 Low-rank approximation 

Now we apply the generalized regression result from Section 13.11 to obtain a result on low-rank 
approximation: approximating a matrix A in the form where L4 has only k columns 

and both Uk and 14 have orthonormal columns. Here, we consider a previous approach (see e.g. 
[Sar06j ): 

1. Let S = UA. 

2 . Let Ps be the orthogonal projection operator to the row space of S. Let A = AP$. 

3. Compute a singular value decomposition of A, and keep only the top k singular vectors. 
Return the resulting low rank approximation Ak of A. 

It turns out computing Ak can be done much more quickly than computing Ak ; see details in 
ICW09I Lemma 4.3]. 

Let Ak be the exact fc-truncated SVD approximation of A (and thus the best rank-A; approx¬ 
imation, in the spectral and Frobenius norms), and let Uk be the top k column singular vectors, 
and Aj, = A — Ak be the tail. 

Theorem 4. If n 

1. satisfies the (k, \Je/8)-approximate spectral norm matrix multiplication property for Uk,Aj. 

2. is a (1/2)-subspace embedding for the column space ofUk 
then 

11^4 — ^4fc || 2 < (1 + e)||-A — Ak\\ 2 + (e/fc)||H — ( 10 ) 

Proof. Note that this procedure chooses the best possible (in the spectral norm) rank-A: approxi¬ 
mation to A subject to the constraint of lying in the row space of S. Thus, the spectral norm error 
can be no worse than the error of a specific such matrix we exhibit. 

We simply choose the matrix obtained by running our generalized regression algorithm from A 
onto Uk, with n: 

UkdUUkfUUky^UUkfUA 

This is rank-A; by construction, since it is multiplied by Uk, and it lies in the row space of S' = UA 
since that is the rightmost factor. On the other hand, it is an application of the regression algorithm 
to A where the optimum output is Ak (since that is the projection of A onto the space of 14). 
Plugging this into Eq. ([9]) gives the desired result. ■ 
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3.3 Kernelized ridge regression 


In nonparametric regression one is given data y* = /*(xj) + wy for i = 1,..., n, and the goal is to 
recover a good estimate for the function f*. Here the y, are scalars, the x, are vectors, and the Wi 
are independent noise, often assumed to be distributed as mean-zero gaussian with some variance 
a 2 . Unlike linear regression where f*(xi) is assumed to take the form (/3,x) for some vector j3, 
in nonparametric regression we allow f* to be an arbitrary function from some function space. 
Naturally the goal then is to recover some / from the data so that, as n grows, the probability that 
/ is “close” to f* increases at some good rate. 

The recent work [YPW15] considers the well studied problem of obtaining / so that ||/ — f*\\n 
is small with high probability over the noise w, where one uses the definition 


1 

11/ - 9 lln = - - g(Xi)) 2 . 

Ti . 

1=1 

The work [YPW15] considers the case where f* comes from a Hilbert space T~L of functions / such 
that / is guaranteed to be square integrable, and the map x H > f(x) is a bounded linear functional. 
The function / is then defined to be the optimal solution to the Kernel Ridge Regression (KRR) 
problem of computing 


f LS = argmin 
f&u 


1 

2 n 


n 

^2(Vi - + Xn ■ \\f\\n 


( 11 ) 


for some parameter X n . It is known that any T-L as above can be written as the closure of the set 
of all functions 

N 

g(-) = ^ 2 <XiK,zi), (12) 

i =1 


over all a € and vectors z \,..., zn for some positive semidefinite kernel function k. Further¬ 
more, the optimal solution to Eq. (fill) can be expressed as f LS = Yli =l a f S ' x i ) l° r some choice 
of weight vector a LS , and it is known that \\f LS — f*\\ n will be small with high probability, over 
the randomness in w, if X n is chosen appropriately (see |YPW15] for background references and 
precise statements). 

After rewriting Eq. dill) using Eq. C2D and defining a matrix K with K t j = k(xi,Xj), one 
arrives at a reformulation for KRR of computing 


,.LS 


, 1 T r, 1 rp 

a"" = argmin < —a K a - a Ky + X n a Ka 

aeR™ l n 


-K 2 + 2 A n K 
n 


-i 


• ~Ky, 
n 


which can be computed in 0(n 3 ) time. The work [ YPW15 ] then focuses on speeding this up, by 
instead computing a solution to the lower-dimensional problem 


1 


1 


a LS = argmin \ — a 1 TiK 2 lf r a - a T UI\y + X n a T IiKI[ T a 

aeR m l2n n 


1 . 


= ( -uk 2 u t + 2A n n/\n 


T 


n 


-l 


-U Ky 

n 


and then returning as / the function specified by the weight vector a = H T a LS . Note that once 
various matrix products are formed (where the running time complexity depends on the n being 
used), one only needs to invert an m x m matrix thus taking 0(m 3 ) time. They then prove that 
||/ — f*\\ n is small with high probability as long as n satisfies two deterministic conditions (see the 
proof of Lemma 2 |YPW15l Section 4.1.2], specifically equation (26) in that work): 
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• n is a (l/2)-subspace embedding for a particular low-dimensional subspace 

• 11nil11 = 0(||51|) for a particular matrix B of low stable rank ( B is UD2 in IYPW15| ). Note 

||ILB|| = ||(nP) T IIB || 1/2 < (||(ILB) t ILB - b t b\\ + \\b t b \\) 1/2 < \\(UB) t UB-B t b\\ 1/2 +\\b 

and thus it suffices for II to provide the approximate matrix multiplication property for the 
product B t B, where B has low stable rank. 

The first bullet simply requires a subspace embedding in the standard sense, and for the second 
bullet |YPW15] avoided AMM by obtaining a bound on ||ILB|| directly by their own analyses for 
gaussian II and the SRHT (in the gaussian case, it also follows from ;l>\ 13. Theorem 3.2]). Our 
result thus provides a unifying analysis which works for a larger and general class of II, including 
for example sparse subspace embeddings. 


3.4 A>means clustering 


In the works |BZMD15l lCEM + 15 , the authors considered dimensionality reduction methods for 
fc-means clustering. Recall in fc-means clustering one is given n points x \,... ,x n € M d , as well as 
an integer k > 1, and the goal is to find k points yi, ■ ■ •, y* € minimizing 


^min||xi - Vj\\l. 
1=1 


That is, the n points can be partitioned arbitrarily into k clusters, then a “cluster center” should 
be assigned to each cluster so as to minimize sums of squared Euclidean distances of each of the n 
points to their cluster centers. It is a standard fact that once a partition V = {Pi,..., P&} of the 
n points into clusters is fixed, the optimal cluster centers to choose are the centroids of the points 
in each of the k partitions, i.e. y.j = (l/|Pjj) • YliePj Xi - 

One key observation common to both of the works [ BZMD15 . iCEM + 15] is that /c-means cluster¬ 
ing is closely related to the problem of low-rank approximation. More specifically, given a partition 
V = {Pi,..., Pfc}, define the n x k matrix Xp by 


(Xp)ij 


V®’ 

0 , 


if i € Pj 
otherwise 


Let A € M. nxd have rows x\,..., x n . Then the /c-means problem can be rewritten as computing 

argminp|| A - X-pX%A\\p 

where V ranges over all partitions of {1,... ,n} into k sets. It is easy to verify that the non-zero 
columns of Xp are orthonormal, so XpXp is the orthogonal projection onto the column space 
of Xp. Thus if one defines S as the set of all rank at most k orthogonal projections obtained as 
XpXp for some /c-partition P, then the above can be rewritten as the constrained rank-k projection 
problem of computing 

argmin P& s||(J - P)A\\ 2 F . (13) 
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One can verify this by hand, since the rows of A are the points Xi, and the ztli row of PA for 
P = X-pX^p is the centroid of the points in i’s partition in V. 

The work jCEM + 15 showed that if S is any subset of projections of rank at most k (henceforth 
rank-k projections ) and II € M mxrf satisfies certain technical conditions to be divulged soon, then 
if P € S satisfies 

||(I - P)^n T ||^ < 7 • min Pe5 ||(/ - P)AU t \\ 2 f , (14) 


then 


\\(I-P)A\\ 2 F < 


(1 + g ) 

(1-e) 


7 -minp e 5 ||(/-P) J 4 ||p. 


(15) 


One set of sufficient conditions for II is as follows (see jCEM + 15 Lemma 10]). Let A^ denote 
the best rank- A: approximation to A and let A^ = A — A^. Define Z € R dxr for r = 2k by 
Z = V r , i.e. the top r right singular vectors of A are the columns of Z. Define B\ = Z T and 
{A - AZZ T ). Define B € M( n + r ) X(i as having B\ as its first r rows and B 2 as its lower 


B 2 = 


Vk 


U- k h 


n rows. Then [CEM + 15l Lemma 10] states that Eq. (|14l) implies Eq. (1151) as long as 


|| ( UB t ) t (UB t ) - BB t || < e, (16) 

and |||IIS 2 ||p — H^lll’l < efc (17) 

One can easily check ||B || 2 = 1 and ||-B||p < 3k, so the stable rank r(B) is at most 3k. Thus 
Eq. (1161) is implied by the (3fc, e/2)-AMM property for B T , B T , and our results apply to show that 
II can be taken to have m = 0({k + \og(l/5)) / e 2 ) rows to have success probability 1 — 5 for Eq. (fT6l) . 
Obtaining Eq. (1171) is much simpler and can be derived from the JL moment property (see the proof 
of (KN141 Theorem 6.2]). 

Without our results on stable-rank AMM provided in this current work, [CEM + 15] gave a 
different analysis, avoiding jCEM + 15i Lemma 10], which required II to have m = ®{k- \og(l/5)/e 2 ) 
rows (note the product between k and log(l/<5) instead of the sum). 


4 Stable rank and row selection 

As well as random projections, approximate matrix multiplication (and subspace embeddings) by 
row selection are also common in algorithms. This corresponds to setting II to a diagonal matrix S 
with relatively few nonzero entries. Unlike random projections, there are no oblivious distributions 
of such matrices S with universal guarantees. Instead, S must be determined (either randomly or 
deterministically) from the matrices being embedded. 

There are two particularly algorithmically useful methods for obtaining such S. The first 
is importance sampling: independent random sampling of the rows, but with nonuniform sam¬ 
pling probabilities. This is analyzed using matrix Chernoff bounds |AW02| . and for the case of 
/c-dimensional subspace embedding or approximate matrix multiplication of rank-k matrices, it can 
produce 0(k(\ogk)/e 2 ) samples [SSllj . The second method is the deterministic selection method 
given in [BSB12I . often called “BSS”, choosing only 0(k/e 2 ) rows. This still runs in polynomial 
time, but originally required many relatively expensive linear algebra steps and thus was slower in 
general; see [LS15] for runtime improvements. 

The matrix Chernoff methods can be extended to the stable-rank case, making even the log 
factor depend only on the stable rank, using “intrinsic dimension” variants of the bounds as pre¬ 
sented in Chapter 7 of ]Trol5j . Specifically, Theorem 6.3.1 of that work can be applied with each 
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n summands each equal to ^ (^j~ a Jbi — A T B^j , where a* is the ith row of A. and i is random with 
the probability of choosing a particular row i equal to 

_ || ai || 2 + || 6 j || 2 

ft “E,IM 2 + IIM 2 

We here give an extension of BSS that covers low stable rank matrices as well. 

Theorem 5. Given an n by d matrix A such that ||A|| 2 < 1 and \\A\\ 2 F < k, and an e € (0,1), 
there exists a diagonal matrix S with 0(k/e 2 ) nonzero entries such that 

\\(SA) t (SA) - A t A\\ < e 

Such an S can be computed by a polynomial-time algorithm. 

When A t A is the identity, this is just the original BSS result. It is also stronger than Theorem 
3.3 of [KMST10] , implying it when A is the combination of the rows yjN/T • m from that theorem 
statement with an extra column containing the costs, and a constant e. The techniques in that 
paper, on the other hand, can prove a result comparable to Theorem [H but with the row count 
scaling as k/e 3 rather than k/e 2 . 

Proof. The proof closely follows the original proof of BSS. However, for simplicity, and because 
the tight constants are not needed for most applications, we do not include jBSSl2 , Claim 3.6] and 
careful parameter-setting. 

At each step, the algorithm will maintain a partial approximation Z = ( SA) T (SA ) (the matrix 
“A” in |BSS12j ). with S beginning as 0. Additionally, we keep track of upper and lower “walls” X u 
and Xp, in the original BSS these are just multiples of the identity. The final S will be returned by 
the algorithm (rescaled by a constant so that the average of the upper and lower walls is A T A). 
We will maintain the invariants 


tr(A(X u - Z)~ 1 A t ) < 1 (18) 

tr(A(Z - X l )~ 1 A T ) < 1. (19) 

These are the so-called upper and lower potentials from BSS. We also require X u -< Z -< Xp, recall 
M -< M' means that M' — M is positive definite. Note that unlike |BSS12j . here we do not apply 
a change of variables making A T A the identity (to avoid confusion, since that would change the 
Frobenius norm). This is the reason for the slightly more complicated form of the potentials. 

In the original BSS, X u and A) were always scalar multiples of the identity (here, without the 
change of variables, that would correspond to always being multiples of A T A). [BSS12j thus simply 
represented them with scalars. Like BSS, we will increase X u and Xi by multiples of A T A-however, 
the key difference from BSS is that they are initialized to multiples of the identity, rather than 
A t A. In particular, we may initialize X u to kl and Xi to —kl. This is still good enough to get 
the spectral norm bounds we require here (as opposed to the stronger multiplicative approximation 
guaranteed by BSS). 

We will have two scalar values, 5 U and 5i, depending only on e; they will be set later. One step 
consists of 

1. Choose a row a* from A and a positive scalar t, and add tataj to Z (via increasing the i 
component of S). 
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2. Add 5 U A T A to X u and 5iA T A to Xi. 


We will show that with suitable values of 6 U and 61 , for any Z obeying the invariants there always 
exists a choice of i and t such that the invariants will still be true after the step is complete. This 
corresponds to Lemmas 3.3 through 3.5 of BSS. 

For convenience, we define, at a given step, the matrix functions of y 

M u (y) = ({X u +yA T A)-Z)- 1 
M l (y) = (Z-(X l + yA T A))- 1 . 

The upper barrier value, after making a step of ta^af and increasing X u , is 

tr(A((X u + 5 u A t A) - (Z + ta i aJ))- 1 A T ). 


Applying the Sherman-Morrison formula, and cyclicity of trace, to the rank-1 update tataf, 
can be rewritten as 


tr(AM u (5 u )A T ) + 


taf M U (5 U )A T AM u (5 u )ai 
1 - ta[M u (5 u )ai 


this 


Since the function f(y) = tr(AM u (y)A T ) is a convex function of y with derivative 


f'(y) = — tr(AM u (y)A T AM u (y)A T ), 


we have f(6 u ) — /(0) < —S u tr(AM u (S u )A T AM U (6 U )A T ). Then the difference between the barrier 
before and after the step is at most 


taf M u (5 u )A t AM u (5 u )ai 
1 - tajM u (5 u )ai 


5 U tr{AM u (5 u )A T AM u {5 u )A t ). 


Constraining this to be no greater than zero, rewriting in terms of | and pulling it out gives 


1 > aj M U (5 U )A T AM u (5 u )aj 

t ~ S u tr(AM u (6 u )A T AM U (8 U )A T ) 


+ af M u {5 u )ai. 


Furthermore, as long as 4 is at least this, Z will remain below X Ul since the barrier must approach 
infinity as t approaches the smallest value passing X u . 

For the lower barrier value after the step, we get 


tr(A((Z + taiaj) - (Xi + 5iA T A))- l A T ). 


Again, applying Sherman-Morrison rewrites it as 


tr(AMi(6i)A T ) 


taj Mj(6i)A T AMi(5i)aj 
1 + taf Mi(5i)cii 


Again, due to convexity the increase in the barrier from raising X[ is at most 5i times the local 
derivative. The difference in the barrier after the step is then at most 


taf Mj(5i)A T AMi(5i)aj 
1 + taf Mi(5i)ai 


+ Si tr(AM l (S l )A T AM l (5 l )A T ). 
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This is not greater than zero as long as 


1 ^ af M l (6 l )A T AM^Sfidi 

t ~ SMAMtWATAMtfriAT) 


ajMi(6i)ai. 


There is some value of t that works for a* as long as the lower bound for % is no larger than 
the upper bound. To show that there is at least one choice of i for which this holds, we look at the 
sum of all the lower bounds and compare to the sum of all the upper bounds. Summing the former 
over all i gets 


tr(AM u {5u)A T AM u (6 u )A T ) 
5 U tr(AM u {5 u )A T AM U {5 U )A T ) 


+ tr(AM u (6 u )A T ) 


and the latter gets 


tr(AM l (5 l )A T AM l (S l )A T ) 
Si tr{AM l {5 l )A T AM l {5 l )A T ) 


tr(AM l (5i)A T ). 


Finally, note that 


tr(AM u (5 u )A T ) = tr(A((X u + 5 U A T A) - Z)~ l A T ) < tr(A(X u - < 1 

and the lower barrier implies Z — Xi >- A T A, implying that as long as Si < 

tr(AMi{5i)A T ) =tr{A{Z ~{Xi + 5iA t A))~ 1 A t ) < 2 tr{A(Z - X i )~ 1 A T ) < 2. 


Thus, we can always make a step as long as 5 U and 5i are set so that 


1 1 

-M <-2 

5 U ~Si 


and <5; < 7j. This is satisfied by 


5 U — e + 2e 2 
51 = e — 2e 2 . 


Before the first step, X u and Xi can be initialized as kl and —kl, respectively. If the algorithm 
is then run for steps, we have: 

X u = -A t A + 2 kA r A + kl 

£ 

< -A T A + 3kI 

£ 

Xi = -A t A - 2 kA T A - kl 

£ 

Z -A T A - 3 kl. 

£ 

and j^Xi both end up within 3 £l of A T A , so | Z (from \/^S) satisfies the requirements of 
the output for 3e (one can simply apply this argument for e/3). Furthermore, all the computa¬ 
tions required to verify the preservation of invariants and compute explicit ts can be performed in 
polynomial time. ■ 

This obtains more general AMM as a corollary: 
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Corollary 1. Given two matrices A and B, each with n rows, and an e € (0,1), there exists a 
diagonal matrix S with 0{k/e 2 ) nonzero entries satisfying the (k,e)-AMM property for A, B. Such 
an S can be computed by a polynomial-time algorithm. 

Proof. Apply Theorem[5]to a matrix X consisting of the columns of —/= - - -=- appended 

^ J b V2max(||A|| 2 ,||A|| F /Vfc) ^ 

to the columns of ■ — — n -and use the resulting S. 

V2max(\\B\\2,\\B\\ F /Vk) 

Note that X satisfies the conditions of that theorem, since concatenating the sets of columns at 
most adds the squares of their spectral and Frobenius norms. ( SA) T (SB ) — A T B is a submatrix of 
2m a x(\\A\\ 2 ,\\A\\ F /Vk)ma,x(\\B\\ 2 ,\\B\\ F /Vk)((SX) T (SX) - X T X), so its spectral norm is upper 
bounded by the spectral norm of that matrix, which in turn is bounded by the guarantee of 
Theorem [5j ■ 


Acknowledgments 

We thank Jaroslaw Blasiok for pointing out the connection between low stable rank approximate 
matrix multiplication and the analyses in [ YPW15] , 


References 


[AC09] 

[AL13] 

[AW02] 

[Boul4] 


Nir Ailon and Bernard Chazelle. The fast Johnson-Lindenstrauss transform and ap¬ 
proximate nearest neighbors. SIAM J. Comput., 39(l):302-322, 2009. 

Nir Ailon and Edo Liberty. An almost optimal unrestricted fast Johnson-Lindenstrauss 
transform. ACM Transactions on Algorithms , 9(3):21, 2013. 

Rudolf Ahlswede and Andreas J. Winter. Strong converse for identification via quan¬ 
tum channels. IEEE Transactions on Information Theory , 48(3):569-579, 2002. 

Jean Bourgain. An improved estimate in the restricted isometry problem. Geometric 
Aspects of Functional Analysis , 2116:65-70, 2014. 


[BSS12] Joshua D. Batson, Daniel A. Spielman, and Nikhil Srivastava. Twice-Ramanujan 
sparsifiers. SIAM J. Comput ., 41(6):1704-1721, 2012. 

[BZMD15] Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas. 

Randomized dimensionality reduction for k-means clustering. IEEE Transactions on 
Information Theory , 61 (2): 1045-1062, 2015. 

[CCF04] Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items 
in data streams. Theor. Comput. Sci., 312(1):3—15, 2004. 

[CEM + 15] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina 
Persu. Dimensionality reduction for k-means clustering and low rank approximation. 
In Proceedings of the flth ACM Symposium on Theory of Computing (STOC), 2015. 
Full version at http://arxiv.org/abs/1410.6801v3, 


20 







[CLL+10] 

[CLM+15] 

[Cohl6] 

[CW09] 

[CW13] 

[DKM06] 

[DKS10] 

[DMM06] 

[DMMW12] 

[FR13] 

[GM13] 

[HMT11] 


Pei-Chun Chen, Kuang-Yao Lee, Tsung-Ju Lee, Yuh-Jye Lee, and Su-Yun Huang. 
Multiclass support vector classification via coding and regression. Neurocomputing, 
73(7-9): 1501-1512, 2010. 

Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, 
and Aaron Sidford. Uniform sampling for matrix approximation. In Proceedings of 
the 6th Annual Conference on Innovations in Theoretical Computer Science (ITCS), 
pages 181-190, 2015. 

Michael B. Cohen. Simpler and tighter analysis of sparse oblivious subspace em¬ 
beddings. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete 
Algorithms (SODA), to appear , 2016. 

Kenneth L. Clarkson and David P. Woodruff. Numerical linear algebra in 
the streaming model. In Proceedings of the fist Annual ACM Symposium 
on Theory of Computing (STOC), pages 205-214, 2009. Full version at 

http://researcher.watson.ibm.com/researcher/files/us-dpuoodru/cw09.pdf. 

Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and 
regression in input sparsity time. In Proceedings of the f5th ACM Sympo¬ 
sium on Theory of Computing (STOC), pages 81-90, 2013. Full version at 

http://arxiv.org/abs/1207.6365v4. 

Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms 
for matrices I: approximating matrix multiplication. SIAM J. Comput., 36(1):132-157, 
2006. 

Anirban Dasgupta, Ravi Kumar, and Tamas Sarlos. A sparse Johnson-Lindenstrauss 
transform. In Proceedings of the f2nd ACM Symposium on Theory of Computing 
(STOC), pages 341-350, 2010. 

Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for 
1 2 regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM 
Symposium on Discrete Algorithms (SODA), pages 1127-1136, 2006. 

Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. 
Fast approximation of matrix coherence and statistical leverage. Journal of Machine 
Learning Research, 13:3475-3506, 2012. 

Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive 
Sensing. Applied and Numerical Harmonic Analysis. Birkhauser, 2013. 

Alex Gittens and Michael W. Mahoney. Revisiting the nystrom method for improved 
large-scale machine learning. In Proceedings of the 30th International Conference on 
Machine Learning (ICML), pages 567-575, 2013. 

Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with 
randomness: Probabilistic algorithms for constructing approximate matrix decompo¬ 
sitions. SIAM Review, 53(2):217-288, 2011. 


21 



[HR16] 

Ishay Haviv and Oded Regev. The restricted isometry property of subsampled Fourier 
matrices. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete 
Algorithms (SODA), to appear, 2016. 

[HW71] 

David Lee Hanson and Farroll Tim Wright. A bound on tail probabilities for quadratic 
forms in independent random variables. Ann. Math. Statist., 42(3):1079-1083, 1971. 

[JL84] 

William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into 
a Hilbert space. Contemporary Mathematics, 26:189-206, 1984. 

[KMN11] 

Daniel M. Kane, Raghu Meka, and Jelani Nelson. Almost optimal explicit johnson- 
lindenstrauss families. In Proceedings of the lfth International Workshop on Random¬ 
ization and Computation (RANDOM), pages 628-639, 2011. 

[KMST10] 

Alexandra Kolia, Yury Makarychev, Amin Saberi, and Shang-Hua Teng. Subgraph 
sparsification and nearly optimal ultrasparsifiers. In Proceedings of the Forty-second 
ACM Symposium on Theory of Computing, STOC ’10, pages 57-66, New York, NY, 
USA, 2010. ACM. 

[KN14] 

Daniel M. Kane and Jelani Nelson. Sparser Johnson-Lindenstrauss transforms. J. 
ACM, 61 (1):4, 2014. 

[KVZ14] 

Anastasios T. Kyrillidis, Michail Vlachos, and Anastasios Zouzias. Approximate ma¬ 
trix multiplication with application to linear embeddings. CoRR, abs/1403.7683, 2014. 

[KW11] 

Felix Krahmer and Rachel Ward. New and improved Johnson-Lindenstrauss embed¬ 
dings via the Restricted Isometry Property. SIAM J. Math. Anal, 43(3):1269-1281, 
2011. 

[LBKW14] 

Yingyu Liang, Maria-Florina Balcan, Vandana Kanchanapally, and David P. 
Woodruff. Improved distributed principal component analysis. In Proceedings of 
the 27th Annual Conference on Advances in Neural Information Processing Systems 
(NIPS), pages 3113-3121, 2014. 

[LDFU13] 

Yichao Lu, Paramveer Dhillon, Dean Foster, and Lyle Ungar. Faster ridge regression 
via the subsampled randomized Hadamard transform. In Proceedings of the 26th 
Annual Conference on Advances in Neural Information Processing Systems (NIPS), 
2013. 

[LMP13] 

Mu Li, Gary L. Miller, and Richard Peng. Iterative row sampling. In Proceedings 
of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 
pages 127-136, 2013. 

[LP86] 

Frangois Lust-Piquard. Inegalites de Khintchine dans C p (1 < p < 00 ). C. R. Math. 
Acad. Sci. Paris, 303(7):289-292, 1986. 

[LPP91] 

Frangois Lust-Piquard and Gilles Pisier. Noncommutative Khintchine and Paley in¬ 
equalities. Ark. Mat., 29(2):241-260, 1991. 


22 


[LS15] 

Yin Tat Lee and He Sun. Constructing linear sized spectral sparsification in almost 
linear time. In Proceedings of the 56th Annual IEEE Symposium on Foundations of 
Computer Science (FOCS), pages 250-269, 2015. 

[LWM+07] 

Edo Liberty, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark 
Tygert. Randomized algorithms for the low-rank approximation of matrices. Proceed¬ 
ings of the National Academy of Sciences, 104(51):20167-20172, 2007. 

[Mahll] 

Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations 
and Trends in Machine Learning, 3(2):123-224, 2011. 

[MM 13] 

Xiangrui Meng and Michael W. Mahoney. Low-distortion subspace embeddings in 
input-sparsity time and applications to robust linear regression. In Proceedings of the 
45th ACM Symposium on Theory of Computing (STOC), pages 91-100, 2013. 

[MZ11] 

Avner Magen and Anastasios Zouzias. Low rank matrix-valued Chernoff bounds and 
approximate matrix multiplication. In Proceedings of the 22nd Annual ACM-SIAM 
Symposium on Discrete Algorithms (SODA), pages 1422-1436, 2011. 

[NN13] 

Jelani Nelson and Huy L. Nguyen. OSNAP: Faster numerical linear algebra algorithms 
via sparser subspace embeddings. In Proceedings of the 54 th Annual IEEE Symposium 
on Foundations of Computer Science (FOCS), pages 117-126, 2013. 

[NN14] 

Jelani Nelson and Huy L. Nguyen. Lower bounds for oblivious subspace embeddings. 
In Proceedings of the International Colloquium on Automata, Languages, and 

Programming (ICALP), pages 883-894, 2014. 

[NPW14] 

Jelani Nelson, Eric Price, and Mary Wootters. New constructions of RIP matrices 
with fast multiplication and fewer rows. In Proceedings of the 25th Annual ACM- 
SIAM Symposium on Discrete Algorithms (SODA), 2014. 

[RHV11] 

Nima Reyhani, Hideitsu Hino, and Ricardo Vigario. New probabilistic bounds on 
eigenvalues and eigenvectors of random kernel matrices. In Proceedings of the Twenty- 
Seventh Conference on Uncertainty in Artificial Intelligence (UAI), pages 627-634, 

2011. 

[RV13] 

Mark Rudelson and Roman Vershynin. Hanson-Wright inequality and sub-gaussian 
concentration. Electronic Communications in Probability, 18:1-9, 2013. 

[Sar06] 

Tamas Sarlos. Improved approximation algorithms for large matrices via random 
projections. In Proceedings of the 41th Annual IEEE Symposium on Foundations of 
Computer Science (FOCS), pages 143-152, 2006. 

[SS11] 

Daniel A. Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. 
SIAM J. Comput., 40(6): 1913-1926, 2011. 

[Troll] 

Joel A. Tropp. Improved analysis of the subsampled randomized Hadamard transform. 
Adv. Adapt. Data Anal, 3(1-2):115-126, 2011. 

[Trol5] 

Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprint 
arXiv:1501.01571, 2015. 


23 


[TZ12] Mikkel Thorup and Yin Zhang. Tabulation-based 5-independent hashing with applica¬ 
tions to linear probing and second moment estimation. SIAM J. Comput., 41(2):293- 
331, 2012. 

[Woo 14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and 
Trends in Theoretical Computer Science , 10(1-2):1-157, 2014. 

[YPW15] Yun Yang, Mert Pilanci, and Martin J. Wainwright. Randomized sketches for kernels: 
Fast and optimal non-parametric regression. CoRR, abs/1501.06195, 2015. 

Appendix 

A OSE moment property 

In the following two subsections we show the OSE moment property for both subgaussian matrices 
and the SRHT. 

A.l Subgaussian matrices 

In this section, we show the OSE moment property for distributions satisfying a JL condition, 
namely the JL moment property. This includes matrices with i.i.d. entries that are mean zero and 
subgaussian with variance 1/m. 

Definition 4. KMX 11, Let V be a distribution over M m><n . We say V has the (e, S,p)-JL moment 
property if for all x £ M n of unit norm, 

E . \\\nxf - 1\*> < e* ■ 6. 

II r^U 

The following theorem follows from the proof of Lemma 8 in the full version of jCW13j . We 
give a different proof here inspired by the proof of [FR131 Theorem 9.9], which is slightly shorter 
and more self-contained. A weaker version appears in [Sar061 Lemma 10], where the size bound on 
X is ( Cd/e) d for a constant C > 1 instead of simply C d . 

Theorem 6. Let U £ M nx<i with orthonormal columns be arbitrary. Then there exists a set X C M n , 

| X | < 9 d , each of norm at most 1 such that 

\\{UU) T {UU) -I || < 2 ■ sup |||ILe || 2 - 1| 

x&X 

Proof. We will show that if sup^g^ |||ILc || 2 — 1| < e/2 then || (IR7) T (IR7) — 7|| < e, where e > 0 
is some positive real. Define A = (LR/) r (IK7) — I. Since A is symmetric, 

||A|| = sup \x T Ax\ = sup | (Ax,x) \ 

||cc|| = 1 ||a;||=l 

Let T 7 be a finite 7 -net of £ d , i.e. T 7 C l d an d f° r every x € of unit norm there exists a y € T 7 
such that \\x — y \\2 < 7 . As we will see soon, there exists such a T 7 of size at most (1 + 2/'y) d . We 
will show that if II satisfies the JL condition on T 1 = {Uy : y € T 1 / 4 } with error e/2, then ||A|| < e; 
that is, (1 — e/ 2 )||x ||2 < ||ILr ||2 < (1 + e/ 2 )||x ||2 for all x £ T'. 
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Let x be a unit norm vector that achieves the sup above, i.e. ||A|| = | ( Ax,x ) |. Then, letting y 
be the closest element of T 7 to x, 


PH = I (Ax,x) | 

= I (Ay, y) + (A(x + y),x-y)\ 

< | + IPII ' \\x + y\\ ■ IP — 2/11 

<f+ 27MII- 

Rearranging gives ||A|| < e/(2(l — 27 )), which is e for 7 = 1/4. 

Now we must show that we can take |T 7 | < (1 + 2/'y) d . The following is a standard cover¬ 
ing/packing argument for bounding metric entropy. Imagine packing as many radius-( 7 / 2 ) I 2 balls 
as possible into R d , centered at points with at most unit norm and such that these balls do not 
intersect each other. Then these balls all fit into a radius-(l + 7 / 2 ) I 2 ball centered at the origin, 
and thus the number of balls we have packed is at most the ratio of the volume of a (1 + 7 / 2 ) ball 
to the volume of a 7/2 ball, which is ((1 + 7 / 2 )/( 7 / 2 )) rf = (1 + 2/'y) d . Now, take those maximally 
packed radius-( 7 / 2 ) balls and double each of their radii to be radius 7 . Then every point in the 
unit ball is contained in at least one of these balls by the triangle inequality, which is exactly the 
property we wanted from T 7 (T 7 is just the centers of these balls). To see why every point is in 
at least one such ball, if some x € of unit norm is not contained in any doubled ball then a 
7 / 2 -ball about x would be disjoint from our maximally packed 7/2 balls, a contradiction. ■ 

Lemma 4. If V satisfies the ( e,5,p)-JL moment property, then V satisfies the (2e, 9 d 5,d,p)-OSE 
moment property 

Proof. By Theorem [6l there exists a subset X C R n of at most 9 d points such that 

E ||(m7) T (n[/) - I\\ p < 2 P • Esup |||rLr|| 2 - l| p 

xex 

< 2 P ■ Y E ll|np 2 - i| p 

xex 

<2 P -9 d -e p -5 
= ( 2e) p • 9 d 6. 


It is known that if V is a distribution over R mxn with m = f!(log(l /5)/e 2 ) and for II ~ V, 
the entries of II are independent subgaussians with mean zero and variance 1 /m, then V has the 
(e/2, 6, 0(log(l/<5)))-JL moment property [KMNll j. Thus such a matrix has the ( s,5,d,@(d + 
log(l/(5)))-OSE moment property for 6 < 2~ d by Lenmra|4j 

A.2 Subsampled Randomized Hadamard Transform (SRHT) 

Recall the SRHT is the mx n matrix n = (1/y/rn) ■ SHD for n a power of 2 where D has diagonal 
entries aq,..., a n that are independent and uniform in { — 1,1}, H is the unnormalized Hadamard 
transform with ] = (—1 )(*■■?} (treating i, j as elements of the vector space e!/ S2 n ), and S' is a 
sampling matrix. That is, the rows of S are independent, and each row has a 1 in a uniformly 
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random location and zeroes elsewhere. A similar construction is where S is an n x n diagonal 
matrix with S l t = r\i being independent Bernoulli random variables each of expectation m/n (so 
that, in expectation, S selects m rows from HD). We will here show the moment property for 
this latter variant since it makes the notation a tad cleaner, though the analysis we present holds 
essentially unmodified for the former variant as well. 

Our analysis below implies that the SRHT provides an e-subspace embedding for d-dimensional 
subspaces with failure probability 5 for m = 0(e~ 2 (d + log(l/(e<5))) log(d/5)). This is an improve¬ 
ment over analyses we have found in previous works. The analysis in [Trollj only considers constant 
e and 5 = 0(l/d) and for these settings achieves m = 0((d+log n) log d), which is still slightly worse 
than our bound for this setting of e,5 (our bound removes the logn and achieves any 1 /poly(d) 
failure probability with the same m). The analysis in [LDFU13] only allows failure probabilities 
greater than n/e d . They show failure probability <5 + n/e d is achieved for m = 0(dlog(d/5)/e 2 ), 
which is also implied by our result if m < n (which is certainly the case in applications for the 
SRHT to be useful, since otherwise one could use the n x n identity matrix as a subspace embed¬ 
ding). The reason for these differences is that previous works operate by showing HDU has small 
row norms with high probability over D\ since there are n rows, some logarithmic dependence on 
n shows up in a union bound. After this conditioning, one then shows that S works. Our analysis 
does not do any such conditioning at all. Interestingly, such a lossy conditioning approach was done 
even for the case d = 1 jACf)9j . As we see below, these analyses can be improved (essentially the 
logn terms that appear from the conditioning approach can be very slightly improved to logm). 

Our main motivation in re-analyzing the SRHT was not to improve the bounds, but simply to 
clearly demonstrate that the SRHT satisfies the OSE moment property. The fact that our moment 
based analysis below (very slightly) improved m was a fortunate accident. Before we present our 
proof of the OSE moment property for the SRHT, we state a theorem we will use. For a random 
matrix M, we henceforth use ||M|| p to denote (E ||M||g ) 1/,p where ||M||s is the Schatten-p norm, 
i.e. the £ p norm of the singular values of M. 

Theorem 7 (Non-commutative Khintchine inequality [LP86llLPP91] l. Let X \,..., X n be fixed real 
matrices and a±,... ,a n be independent Rademachers. Then 

Vp > 1, II < Vp ■ max III (]T XiXm 8p , ||(^T Xf 

i \ i i 

We will also make use of the Hanson-Wright inequality. 

Theorem 8 (Hanson-Wright [HW7l] ). For (<Tj) independent Rademachers and A symmetric, 

Vp > 1, \\(7 T Aa — Ea T A<j\\p < y/p ■ ||A|| F +p- ||A||. 

We now present our main analysis of this subsection. 

Theorem 9. The SRHT satisfies the (e, 5, d,p)-moment property for p = log (d/5) as long as 
m > £~ 2 (dlog(d/5) + log (d/5) log (m/5)) ~ e~ 2 (d + log(l/(e<5)) log {d/5)). 

Proof. For a fixed U € M nxrf with orthonormal columns, we would like to bound 

E II — (SHDU) t (SHDU) -I\\ p . 

OLtf m 
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Since p > log d we have 
„ 1 


m 


(, SHDU) 1 (SHDU) - 


- \\-(SHDU) 1 (SHDU)-I\\ Sp 

m 


( 20 ) 


by Holder’s inequality. Also, let z ±,..., z n be the rows of HDU , as column vectors, so that 

1 n 

— (SHDU) T (SHDU) = — V rHZizf. (21) 

777 , m * 


m 


i =1 


Note also Yli z i z I = (HDU) T HDU = n ■ I for any D, so the identity matrix is the expectation, 
over rj , of the right hand side of Eq. (121|i for any D. Thus we are left wanting to bound 


— y rjiZizf - E — y 


where the rj) are identically distributed as the rji but independent of them. Below we use || f(X)\\ lp(x) 
to denote (E_\- |/(X)| p ) 1 ' /p . Also we assume p is an integer multiple of 4, so that ||A||g for real sym¬ 
metric A equals (tr(A p )) l / p and ||A.||g p/2 = (tr(A p / 2 )) 2 / p . Thus for (ci) independent Rademachers, 


i^i || p 


— y rnZizf-I\\p = || — y rpzizj - E — y rfii 

m m rf m 

i i i 

= nii-y^y-E-y^yii iP( ,)iu P(a) 

m rf rn " v ' 

i i 

< ^llll y Vi z i z I - y , n'i z i z J\\Lv(r 1 ,ri')\\Lv{ a ) (Jensen’s inequality) 

i i 

— ■ \\Y\(m-ri'i) z i z J\\p 

i 

— • II crdpi — rj'Azizf lip (equal in distribution) 

i 

yy (jjriiZizf \\ p (triangle inequality) 
i 

— ■ ||(y m\\ z i\\l ■ z i z I) 1/2 \\p (Theorem ED 

i 

< ^ ‘ E ((™axrii\\zi\\Z) Hr((^piZiZ?) p/2 )\ (\\M\\ p Sp = tr(M p )) 

\ i / 

< • II max77^11 ^Holiy 2 • (E tr((S^ r\iZiz[) p ^ 2 ) 2 ) 1 ^ 2p (Cauchy-Schwarz) 

i 

< • || max^H^lloliy 2 • (d • Ktr((S^ r]iZizJ) p )) l ^ 2p (Cauchy-Schwarz) 

i 

< — • || ^xpi\\zi\\l\\l /2 • II y PiZizJtJ 2 (since d 1/p < 2 ) 

mi 


( 22 ) 


m 

2 

m 


< 


( 23 ) 
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maxr/ilp. 


'i\\l\\ l J 2 ■ ( d 1/p + 1|— rjiZizJ - I\\ l J 2 ) (triangle inequality) 


(24) 


Eg. (03j) follows since if /% are the singular values of M = Yli r ]i z i z Ti then tr(M p / 2 ) 2 = 
(Si/^f^ 2 ) 2 ) an d the rank of M, and hence the number of summands /%, is at most d. Letting 
Q = W^Y^iViZiZi - I\\ l p 2 an d R = s/p/m ■ II max* rn\\zi\\ 2 \\]/ 2 ■, combining Eq. ([22]) and Eq. (J2U) 

Q 2 <R + RQ 


implying that for some fixed constant C > 0, we have Q 2 — CRQ — CR < 0. This implies that Q is 
at most the larger root of the associated quadratic equation, i.e. Q < rnax{\/i?, R}, or equivalently 



l 


< max{i?, R 2 } 


(25) 


It only remains to bound R, which in turn amounts to bounding || max* |||| 1 IIp^ 2 • Define 


q = rnaxjp, logm}, and note 


lip — 


Then 


max77j||zj||!|L = E max77?(||,Zj|||) 9 

\a,V i 


< 


1/9 

1/9 

1/9 


HI ) 9 


< In- max E ^(H^Hl) 9 


i a,r] 


1/9 


1/9 


= ( n ■ max(Eryf) • (E(||zj|||) 9 ) ) (a,rj independent) 

\ i V a J 

/ \ 1/9 

= ( m ■ maxEdl^ll^) 13 ) J 

< 2 • max HH^Illllg (m 1//<J < 2 by choice of q) 
i 

= 2 • max ||or U{U i a\\ q 

i 

= 2 • max(d + \\a T UiUf a — Ea T C/jf7) r a|| ( j) (triangle inequality) (26) 

i 

where U t is the matrix with ( Ui)k,j = Hi,k ■ Uk,j■ Of particular importance for us is the identity 
lljU t = /. Then by Eq. 051) and Theorem [H] 


max77j||zj|||||g < d + y/q ■ \\UiUf ||f + q • || UiU? 

I 

= d + \fqd + q 
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3 

< - ■ (d + q) (AM-GM inequality) 

so that 



which when combined with Eq. (|25|) gives 

W—^riiZiZ? - I\\ p < J— -(d + q) + — -(d + q). 

m *— J V m m 

i 

Thus the OSE moment property is satisfied by our choices of m,p in the theorem statement. ■ 

A.3 Composing dimensionality reducing maps supporting AMM 

As discussed in Remark [3J to obtain both a good number of rows for II as well as fast multiplication 
for IL4, ILB, one may wish to set II as the composition II = nill 2 , where IIi has the correct number 
mi = 0(k/e 2 ) of rows (e.g. a matrix of subgaussian entries), whereas II 2 maps to a small but 
suboptimal number m 2 of rows (e.g. the SRHT) but supports fast embedding to compute II 2 A. We 
show here that composing maps each supporting AMM yields a final map also giving AMM. 

As discussed in Corollary [H without loss of generality we can assume A = B. Also, as discussed 
in Remark [TJ we can focus on achieving Eq. ([3|) where the number of rows of II should depend on 
the stable rank r and not rank r of A. The key is to note the following simple triangle inequality: 

||(il4) t (il4) - a t a \\ < ||(n 1 n 2 A) T (n 1 n 2 A) - (n 2 A) r (n 2 A)|| + ||(n 2 A) T (n 2 A) - a t a \\ . ( 27 ) 

v-„-„-' 

a p 

The results of this work show that to achieve the desired (3 < e||A|| 2 , it suffices that the number 
of rows of II 2 need only depend on r and not r, as desired. The trouble is that for a, the number of 
rows of IIi w iH need to depend on the stable rank f' of II 2 A and not r. Furthermore, the error will 
be a < e||Il 2 A|| 2 and not a < e||A|| 2 . Thus, we must obtain good bounds on both f' and ||Il 2 A.||. 
To achieve this, note 

||n 2 A|| = ||(n 2 A) T (n 2 A)|| 1/2 = p|| ± ||(n 2 A) T (n 2 A) - a t a \\ 1 / 2 ( 28 ) 

and 

l|n 2 A|| F = tr((n 2 A) T (n 2 A)) 1/2 = ||a|| f ± ||(n 2 A) T (n 2 A) - a t a\\ f ( 29 ) 

Thus, if we condition on /3 = ||(Il 2 A) r (n 2 A) — A 7 A|| < e||A|| 2 (which we already discussed 
above), then indeed we have 11II 2 A.11 = 0(||A||) by Eq. (1281) . Also, }KN14l Theorem 6.2] implies 
||(n 2 A) T (n 2 A) — A T A\\p < e||A|||, with probability 1 — 5 as long as II comes from a distribution 
satisfying the (0(e), 5, £)-JL moment property for some i > 2 (which is just the (0(e),6, l,^)-OSE 
moment property in the terminology of this work). If this holds, then ||Il 2 .A|| F = ©(||A|| F ) by 
Eq. (1291) . and thus r' = @(r), as desired. Then overall, we have that the left hand side of Eq. (1271) is 
at most e ■ ||II 2 A.|| 2 + e ■ ||A|| 2 = 0(e) ■ ||A|| 2 as desired, in which both IIi and II 2 need only provide 
AMM with error e for matrices both of stable rank 0(f). 
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Remark 4. An even slicker argument that works in the case when IR, II 2 are both drawn from 
distributions satisfying the (e, 6, k, ^)-OSE moment property is to observe that the distribution of 
the product IIin 2 itself satisfies the OSE moment property. Indeed, letting \\Z\\ p denote (E \Z\ p ) l ' p 
for a scalar random variable Z, and letting U € W nxk denote a matrix with orthonormal columns, 

WWiU-^UfU^U - I\\\\ e < sS^WW^UfWt (LemmaED 

= e5 1/e \\\\{U2U) T U 2 U\\\\ e 

< e<5 1//£ (l + \\\\(U 2 U) 7 n 2 U — 7||||^) (triangle inequality) 

< s5 1/£ { 1 + e6 1/e ) 

In the first line we used that when A = B in Lemma El IR need only satisfy the OSE moment 
property with parameter k instead of 2k (since then the span of the columns of both A and B has 
dimension at most k). Thus the distribution of the product IIiL^ satisfies the (0(e), 0(6), k, I?)-OSE 
moment property. 
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